Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] CI testing #20

Closed
wants to merge 1 commit into from

Conversation

xwang233
Copy link
Collaborator

This reverts commit 1530744.

@xwang233
Copy link
Collaborator Author

!build

29 similar comments
@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233
Copy link
Collaborator Author

!build

@xwang233 xwang233 force-pushed the ci_test_a_branch_that_fails_to_build branch from 074c862 to 3b5a25a Compare March 7, 2024 08:44
@xwang233
Copy link
Collaborator Author

xwang233 commented Mar 7, 2024

!build

@xwang233 xwang233 closed this Mar 8, 2024
@xwang233 xwang233 deleted the ci_test_a_branch_that_fails_to_build branch March 8, 2024 02:00
jacobhinkle added a commit that referenced this pull request Mar 22, 2024
This introduces a thread-local global memory allocator for each device
and uses it whenever there is an intermediate tensor needed which
requires zero-initialization.

To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the
allocator using `NVFUSER_DUMP=global_zeroed_memory`.

Before we enable this feature by default, we need to have high
confidence that every kernel using zero-initialized memory will always
clean up their semaphores. This is currently only the case for serial
grid reductions, as far as I know.

This enables the basic functionality of #1829. However, it does not
modify existing algorithms to clean up their memory. See
`NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory
build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`,
which succeeds when using serial grid reduction, but fails (in debug
mode) when using `gridReduce` (note that this test is updated to behave
differently in this PR):
```
# NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling                                                       
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = SerialGridReductionTest.Scheduling
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SerialGridReductionTest
[ RUN      ] SerialGridReductionTest.Scheduling
[global zeroed memory] Resizing arena to 512 bytes
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Resizing arena to 16384 bytes
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
unknown file: Failure
C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero
Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first):
frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests)
frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests)
frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests)
frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests)
frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests)
frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests)
frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests)
frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests)
frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests)
frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests)
frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests)
frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests)
frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests)
frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests)
frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests)
frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests)
frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests)
frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests)
frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests)
frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests)
frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling'
[  FAILED  ] SerialGridReductionTest.Scheduling (5669 ms)
[----------] 1 test from SerialGridReductionTest (5669 ms total)
```
This test runs with serial grid reduction, then with `gridReduce`. Each
time it runs two grid reductions. Both serial grid reductions succeed
because the semaphore buffer is properly zeroed. The `gridReduce`
succeeds the first time since the memory pool calls `at::zeros` again to
request a larger buffer size (`gridReduce` requires more semaphores
since there is one per thread segment vs one for each each block
segment). However, the second call to `gridReduce` fails because it has
not cleaned up its semaphores. Hacking that function to force
`PERSISTENT=1` would clean up the semaphores resulting in success in
this case. I'm leaving those kind of modifications for a follow-up.
@xwang233 xwang233 reopened this Aug 9, 2024
@xwang233
Copy link
Collaborator Author

xwang233 commented Aug 9, 2024

!build --pybench

@xwang233 xwang233 closed this Aug 9, 2024
@xwang233 xwang233 restored the ci_test_a_branch_that_fails_to_build branch October 30, 2024 21:02
@xwang233 xwang233 reopened this Oct 30, 2024
@xwang233 xwang233 closed this Oct 30, 2024
@xwang233 xwang233 force-pushed the ci_test_a_branch_that_fails_to_build branch from 5c14e7c to bad9e50 Compare October 30, 2024 21:05
@xwang233 xwang233 deleted the ci_test_a_branch_that_fails_to_build branch October 30, 2024 21:06
@xwang233 xwang233 restored the ci_test_a_branch_that_fails_to_build branch October 31, 2024 18:45
@xwang233 xwang233 reopened this Oct 31, 2024
@xwang233
Copy link
Collaborator Author

!test --pybench

@xwang233
Copy link
Collaborator Author

!test

@xwang233 xwang233 closed this Oct 31, 2024
@xwang233 xwang233 deleted the ci_test_a_branch_that_fails_to_build branch October 31, 2024 21:37
@xwang233 xwang233 restored the ci_test_a_branch_that_fails_to_build branch November 14, 2024 05:23
@xwang233 xwang233 reopened this Nov 14, 2024
@xwang233
Copy link
Collaborator Author

!build --dev

@xwang233
Copy link
Collaborator Author

!build

@xwang233 xwang233 closed this Nov 14, 2024
@xwang233 xwang233 deleted the ci_test_a_branch_that_fails_to_build branch November 14, 2024 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant